自2015年首次介绍以来,深入增强学习(DRL)方案的使用已大大增加。尽管在许多不同的应用中使用了使用,但他们仍然存在缺乏可解释性的问题。面包缺乏对研究人员和公众使用DRL解决方案的使用。为了解决这个问题,已经出现了可解释的人工智能(XAI)领域。这是各种不同的方法,它们希望打开DRL黑框,范围从使用可解释的符号决策树到诸如Shapley值之类的数值方法。这篇评论研究了使用哪些方法以及使用了哪些应用程序。这样做是为了确定哪些模型最适合每个应用程序,或者是否未充分利用方法。
translated by 谷歌翻译
在许多在线决策过程中,要求优化代理在具有许多固有相似之处的大量替代方案之间进行选择。反过来,这些相似性意味着可能会混淆标准离散选择模型和匪徒算法的损失。我们在嵌套土匪的背景下研究了这个问题,这是一类对抗性的多臂匪徒问题,学习者试图在存在大量不同的替代方案的情况下最小化他们的遗憾,并具有嵌入式(非组合)相似性的层次结构。在这种情况下,基于指数级的蓝图(例如树篱,EXP3及其变体)的最佳算法可能会产生巨大的遗憾,因为它们倾向于花费过多的时间来探索与相似,次优成本的无关紧要的替代方案。为此,我们提出了一种嵌套的指数权重(新)算法,该算法根据嵌套的,分步选择方法对学习者的替代方案进行分层探索。这样一来,我们就获得了一系列紧密的界限,以表明学习者可以有效地解决与替代方案之间高度相似性的在线学习问题,而不会发生红色的巴士 /蓝色巴士悖论。
translated by 谷歌翻译
Counterfactual reasoning from logged data has become increasingly important for many applications such as web advertising or healthcare. In this paper, we address the problem of learning stochastic policies with continuous actions from the viewpoint of counterfactual risk minimization (CRM). While the CRM framework is appealing and well studied for discrete actions, the continuous action case raises new challenges about modelization, optimization, and~offline model selection with real data which turns out to be particularly challenging. Our paper contributes to these three aspects of the CRM estimation pipeline. First, we introduce a modelling strategy based on a joint kernel embedding of contexts and actions, which overcomes the shortcomings of previous discretization approaches. Second, we empirically show that the optimization aspect of counterfactual learning is important, and we demonstrate the benefits of proximal point algorithms and differentiable estimators. Finally, we propose an evaluation protocol for offline policies in real-world logged systems, which is challenging since policies cannot be replayed on test data, and we release a new large-scale dataset along with multiple synthetic, yet realistic, evaluation setups.
translated by 谷歌翻译